This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
This will help us focus on the goal of the code (understanding stats better), instead of the code itself, which is a means to an end.
You can take this notebook with you, which will provide templates to execute various analyses.
install.packages("colorbrewer")
Warning in install.packages :
package ‘colorbrewer’ is not available (for R version 3.5.1)
library(tidyverse)
library(modelr)
library(broom)
Attaching package: ‘broom’
The following object is masked from ‘package:modelr’:
bootstrap
library(epiR)
library(janitor)
We’ll be using a table of simplified school-level data regarding the racial makeup of schools and their scores on the PARCC standardized test.
(It’s outside our purview today, but of course normally we would use best practice and study the data dictionary/documentation before we do anything else).
schools_data <- read_csv("ILschools.csv")
Parsed with column specification:
cols(
id = col_character(),
nameSCH = col_character(),
nameDIST = col_character(),
city = col_character(),
county = col_character(),
CHI = col_logical(),
type = col_character(),
pctWhite = col_double(),
pctBlack = col_double(),
pctHisp = col_double(),
pctAsian = col_double(),
pctPCISL = col_double(),
pctNativ = col_double(),
pctMulti = col_double(),
pctLowInc = col_double(),
PARCCpct = col_double()
)
The readr package decisions seem reasonable. Let’s have a look at the top of our table. Scroll through to the right, paying special attention to our percentage columns.
head(schools_data)
The janitor package fixes a lot of common data hygeine problems. In this case, we’ll change these hideous and inconsistent column names to conform with R conventions
schools_data <- schools_data %>% clean_names()
head(schools_data)
summary(schools_data)
id name_sch name_dist
Length:3796 Length:3796 Length:3796
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
city county chi
Length:3796 Length:3796 Mode :logical
Class :character Class :character FALSE:3166
Mode :character Mode :character TRUE :630
type pct_white pct_black
Length:3796 Min. : 0.00 Min. : 0.00
Class :character 1st Qu.: 17.88 1st Qu.: 1.00
Mode :character Median : 62.10 Median : 3.50
Mean : 54.29 Mean : 17.84
3rd Qu.: 88.70 3rd Qu.: 17.30
Max. :100.00 Max. :100.00
pct_hisp pct_asian pct_pcisl
Min. : 0.00 Min. : 0.00 Min. :0.00000
1st Qu.: 2.50 1st Qu.: 0.00 1st Qu.:0.00000
Median : 8.60 Median : 0.80 Median :0.00000
Mean :20.29 Mean : 3.72 Mean :0.09049
3rd Qu.:27.02 3rd Qu.: 3.40 3rd Qu.:0.10000
Max. :99.50 Max. :85.20 Max. :5.00000
pct_nativ pct_multi pct_low_inc
Min. : 0.0000 Min. : 0.000 Min. : 0.00
1st Qu.: 0.0000 1st Qu.: 1.100 1st Qu.: 28.60
Median : 0.0000 Median : 2.700 Median : 49.00
Mean : 0.3381 Mean : 3.429 Mean : 51.68
3rd Qu.: 0.3000 3rd Qu.: 4.700 3rd Qu.: 75.40
Max. :48.9000 Max. :39.600 Max. :100.00
parc_cpct
Min. : 0.00
1st Qu.:19.50
Median :31.40
Mean :33.63
3rd Qu.:46.20
Max. :97.00
NA's :811
We can see that the various columns have very different means. We can also see that the means and medians can be very different. This indicates to us that the distributions are different, and not normal.
It’s a very powerful graphics library in R. We won’t dwell too much on ggplot syntax here, but I encourage you to keep working on it.
First, let’s look at what’s known as a normal distribution.
set.seed(1)
df <- data.frame(PF = 10*rnorm(10000))
ggplot(df, aes(x = PF)) +
geom_histogram(aes(y =..density..),
breaks = seq(-50, 50, by = 5),
colour = "black",
fill = "deepskyblue") +
stat_function(fun = dnorm, args = list(mean = mean(df$PF), sd = sd(df$PF)))
print(paste0("mean: ", round(mean(df$PF),0)))
[1] "mean: 0"
print(paste0("median: ", round(median(df$PF),0)))
[1] "median: 0"
In this distribution, the mean (aka ‘average’) and the median are equal, and both sides are symmetrical around them. But this is unusual in the real world, so looking at distributions can tell us a lot about
Let’s start with the distribution of test scores, which we’ll make with ggplot.
# First tell ggplot which data and variable to use
c <- ggplot(schools_data, aes(parc_cpct))
# Then we tell ggplot which viz to make
c + geom_histogram(binwidth = 10)
What does this tell us about the distribution of school test-passing rates? This is called skew.
Next let’s look at the distribution of the percent black variable
# First tell ggplot which data and variable to use
c <- ggplot(schools_data, aes(pct_black))
# Then we tell ggplot which viz to make
c + geom_histogram(binwidth = 10)
What does this shape tell us about schools in Illinois?
What are some other areas of life where we might find this distribution?
You try: make a histogram of the distribution of pct_low_inc
Let’s explore the relationship between the percent of a school’s students who are low-income, and the percent who are proficient.
First, let’s practice filtering for just elementary schools.
elementary_schools <- schools_data %>% filter(type == "ELEMENTARY")
Now let’s plot the relationship between the percent of a school’s students who are low-income and the percent who achieved proficiency on the PARCC exam.
a <- ggplot(elementary_schools, aes(pct_low_inc, parc_cpct))
a + geom_point(color = "turquoise", alpha = .6)
We can see that in general a higher rate of low-income students at a school is associated with a lower rate of passing the exam.
Let’s formalize this by fitting a line.
a + geom_point(color = "turquoise", alpha = .6) + geom_smooth(method = lm)
low_inc_mod <- lm(parc_cpct ~ pct_low_inc, data = elementary_schools)
summary(low_inc_mod)
Call:
lm(formula = parc_cpct ~ pct_low_inc, data = elementary_schools)
Residuals:
Min 1Q Median 3Q Max
-45.690 -7.288 -0.824 6.851 68.457
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.670894 0.504760 120.20 <2e-16 ***
pct_low_inc -0.492279 0.008203 -60.01 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.45 on 2229 degrees of freedom
(175 observations deleted due to missingness)
Multiple R-squared: 0.6177, Adjusted R-squared: 0.6175
F-statistic: 3602 on 1 and 2229 DF, p-value: < 2.2e-16
Remember y = mx + b ? What does the output above tell us?
You try: Run a linear regression that shows the relationship between the percent white and the percent who passed the exam. What do we learn from this? How do we put it into words?
Does a school being in Chicago affect these results? First, let’s consider this visually.
a <- ggplot(elementary_schools, aes(pct_low_inc, parc_cpct, color = chi))
a + geom_point(alpha = .5)
a + geom_point(alpha = .5) + geom_smooth(method = "lm")
Let’s run a new regression that includes the ‘chi’ variable. And then summarize output.
multi_mod <- lm(parc_cpct ~ pct_low_inc + chi, data = elementary_schools)
summary(multi_mod)
Call:
lm(formula = parc_cpct ~ pct_low_inc + chi, data = elementary_schools)
Residuals:
Min 1Q Median 3Q Max
-46.589 -7.594 -0.537 7.043 72.682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.070922 0.499632 124.23 <2e-16 ***
pct_low_inc -0.548532 0.009067 -60.50 <2e-16 ***
chiTRUE 8.773219 0.688236 12.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.06 on 2228 degrees of freedom
(175 observations deleted due to missingness)
Multiple R-squared: 0.6437, Adjusted R-squared: 0.6434
F-statistic: 2013 on 2 and 2228 DF, p-value: < 2.2e-16
Looking better, but the output is still pretty ugly. We can make it more useful using the broom packaage.
tidy(multi_mod)
Now it’s a data frame, instead of that weird model object.
You practice. Run a linear model where our predictors are pct_white and chi. Then clean up your output using broom.
If we feed our original data frame and our model through the augment() function, it will calculate fitted values for us.
A fitted, or predicted, value is the percent PARCC proficiency our model suggests would be expected for this school given the terms in our model (in this case the percent of a school’s students who are low-income and whether the school is in Chicago).
mod_grid <- lm(parc_cpct ~ pct_low_inc + chi, data = schools_data, na.action = "na.exclude") %>% augment(multi_mod, elementary_schools)
View(head(mod_grid,20))
The difference between the fitted value and the actual value can be interpreted as whether a school is over- or under-performing. We can plot this.
To keep our plot from being too busy, let’s filter out values that are not outliers.
mod_grid <- mod_grid %>% mutate(residual = .fitted - parc_cpct) %>% filter(residual < -20 | residual > 20)
ggplot(mod_grid, aes(x = pct_low_inc, y = parc_cpct)) +
geom_segment(aes(xend = pct_low_inc, yend = .fitted), alpha = .2) +
geom_point(aes(color = residual),alpha = .8) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = .fitted)) +
facet_grid(~ chi)
Let’s find them.
View(mod_grid %>% arrange(residual))
Story language:
Eisenhower Elementary outscored what would be expected for a similar school with mostly low-income students by 73 percentage points.
Add a column for ‘good school’ where more than half of the students have passed the test.
Now let’s use more than one variable to predict whether a school is “good”.
We can use our summary function again to see the output of our regression.
Let’s unpack what our results mean.
You try: Let’s create a variable indicating whether a school is in Cook County or not, called ‘cook’.
Now let’s run a logistic regression with cook and pct_black as our predictors and good_school as our dependent variable, then summarize.
What do we learn?
You try: Run a logistic regression with chi and pct_white as our predictors, and good_school as our dependent variable. Then summarize.
What do we learn?
What are some other questions we could answer with a logistic regression?
Non-linear relationships
Of course, not all relationships are strictly linear. How do we get a sense of the curve’s shape? Turns out ggplot will help us find this with a different type of geom_smooth.
# Define the x and y variables
a <- ggplot(elementary_schools, aes(pct_low_inc, parc_cpct))
# Plot points
a + geom_point(color = "turquoise", alpha = .6) +
geom_smooth(method = "loess")
Do we think a linear model is appropriate here?
That’s not always the case though. Here’s an example from a ProPublica story about redlining in car insurance.
https://www.propublica.org/article/minority-neighborhoods-higher-car-insurance-premiums-methodology